A translated corpus of 30, 000 French SMS

نویسندگان

  • Cédrick Fairon
  • Sébastien Paumier
چکیده

The development of communication technologies has contributed to the appearance of new forms in the written language that scientists have to study according to their peculiarities (typing or viewing constraints, synchronicity, etc). In the particular case of SMS (Short Message Service), studies are complicated by a lack of data, mainly due to technical constraints and privacy considerations. In this paper, we present a corpus of 30,000 French SMS, collected through a project in Belgium named “Faites don de vos SMS à la science” (Give your SMS to Science). This corpus is unique in its quality, its size and the fact that the SMS have been manually translated into “standard” French. We will first describe the collection process and discuss the writers' profiles. Then we will explain in detail how the translation was carried out.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compression textuelle sur la base de règles issues d'un corpus de sms (Textual Compression Based on Rules Arising from a Corpus of Text Messages) [in French]

Textual Compression Based on Rules Arising from a Corpus of Text Messages The present research seeks to reduce the size of text messages on the basis of compression techniques observed mostly in a corpus of sms. This paper explains the methodology followed to establish compression rules. It then presents the 33 considered rules, and illustrates the four suggested levels of compression with two ...

متن کامل

Integration of Lexical and Semantic Knowledge for Sentiment Analysis in SMS

With the explosive growth of online social media (forums, blogs, and social networks), exploitation of these new information sources has become essential. Our work is based on the sud4science project. The goal of this project is to perform multidisciplinary work on a corpus of authentic SMS, in French, collected in 2011 and anonymised (88milSMS corpus: http://88milsms.huma-num.fr). This paper h...

متن کامل

Biomedical Concept Recognition in French Text Using Automatic Translation of English Terms

We addressed the task to automatically recognize and normalize entities in a French medical corpus. To increase the coverage of our initial French terminology, English terms were translated into French by two different automatic translators. Indexing with a terminology that contained the intersection of the translated terms in combination with several post-processing steps to reduce the number ...

متن کامل

An Aligned French-Chinese corpus of 10K segments from university educational material

This paper describes a corpus of nearly 10K French-Chinese aligned segments, produced by postediting machine translated computer science courseware. This corpus was built from 2013 to 2016 within the MACAU project, by native Chinese students. The quality, as judged by native speakers, is adequate for understanding (far better than by reading only the original French) and for getting better mark...

متن کامل

Can we Relearn an RBMT System?

This paper describes SYSTRAN submissions for the shared task of the third Workshop on Statistical Machine Translation at ACL. Our main contribution consists in a French-English statistical model trained without the use of any human-translated parallel corpus. In substitution, we translated a monolingual corpus with SYSTRAN rule-based translation engine to produce the parallel corpus. The result...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006